Clustering large data sets described with discrete distributions and its application on TIMSS data set

نویسندگان

  • Simona Korenjak-Cerne
  • Vladimir Batagelj
  • Barbara Japelj Pavesic
چکیده

Symbolic Data Analysis is based on a special descriptions of data – symbolic objects. Such descriptions preserve more detailed information about data than the standard representations with mean values. A special kind of symbolic object is also representation with distributions. In the clustering process this representation enables us to consider the variables of all types at the same time. We present two clustering methods based on the data descriptions with discrete distributions: the adapted leaders method and the adapted agglomerative hierarchical clustering Ward’s method. Both methods are compatible – they can be viewed as two approaches for solving the same clustering optimization problem. In the obtained clustering, the leader is assigned to each cluster. The descriptions of the leaders offer simple interpretation of the clusters’ characteristics. The leaders method enables us to efficiently solve clustering problems with large number of units; while the agglomerative method is applied on the obtained leaders and enables us to decide upon the right number of clusters on the basis of the corresponding dendrogram. Both methods were successfully applied in analyses of different data sets. In the paper an application on the TIMSS data set is presented. The University of Ljubljana, Faculty of Economics, Department of Statistics, [email protected] (corresponding author) University of Ljubljana, Faculty of Mathematics and Physics, Department of Mathematics, [email protected] The Educational Research Institute, Slovenia, [email protected] 1 descriptions with distributions enable us to combine two data sets: answers of teachers and answers of their students, into one data set. The descriptions of the obtained clusters enable us to interpret the results in more understandable way.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustering Large Data Sets Described With Discrete Distributions and An Application on TIMSS Data Set

Symbolic Data Analysis is based on a special descriptions of data – symbolic objects. Such descriptions preserve more detailed information about data than the usual representations with mean values. A special kind of symbolic object is also representation with distributions. In the clustering process this representation enables us to consider the variables of all types at the same time. We pres...

متن کامل

خوشه‌بندی داده‌ها بر پایه شناسایی کلید

Clustering has been one of the main building blocks in the fields of machine learning and computer vision. Given a pair-wise distance measure, it is challenging to find a proper way to identify a subset of representative exemplars and its associated cluster structures. Recent trend on big data analysis poses a more demanding requirement on new clustering algorithm to be both scalable and accura...

متن کامل

Skew-slash distribution and its application in topics regression

In many issues of statistical modeling, the common assumption is that observations are normally distributed. In many real data applications, however, the true distribution is deviated from the normal. Thus, the main concern of most recent studies on analyzing data is to construct and the use of alternative distributions. In this regard, new classes of distributions such as slash and skew-sla...

متن کامل

A Hybrid Time Series Clustering Method Based on Fuzzy C-Means Algorithm: An Agreement Based Clustering Approach

In recent years, the advancement of information gathering technologies such as GPS and GSM networks have led to huge complex datasets such as time series and trajectories. As a result it is essential to use appropriate methods to analyze the produced large raw datasets. Extracting useful information from large data sets has always been one of the most important challenges in different sciences,...

متن کامل

The Family of Scale-Mixture of Skew-Normal Distributions and Its Application in Bayesian Nonlinear Regression Models

In previous studies on fitting non-linear regression models with the symmetric structure the normality is usually assumed in the analysis of data. This choice may be inappropriate when the distribution of residual terms is asymmetric. Recently, the family of scale-mixture of skew-normal distributions is the main concern of many researchers. This family includes several skewed and heavy-tailed d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Statistical Analysis and Data Mining

دوره 4  شماره 

صفحات  -

تاریخ انتشار 2011